Stock market prediction

The objective of this project is to predict stock returns from the panel-type data.

Importing libraries

Let us load the data and packages.

library(tidyverse)
library(dplyr) # for data manipulation
library(tidyr) # to tidy data
library(ggplot2) # for the plots
library(lightgbm) # for LightGBM model
library(corrplot) # for correlation plot
library(reshape2)
library(forecast)
library(plotly)

Functions used in this notebook

This includes some functions I created to be used in this notebook.

drop_ghg_columns <- function(data) {
    data <- data[, !colnames(data) %in% c("ghg_s1", "ghg_s2", "ghg_s3", "year_month")]
    return(data)
}

z_score_normalization <- function(x) {
  (x - mean(x)) / sd(x)
}

Data wrangling

Let us load the dataset that comes in a RData format.

load("data/stocks_clean.RData") # Loading the dataset

df <- stocks_clean # Assigning the dataset to a variable df
  
rm(stocks_clean) # now that df is assigned, remove stocks_clean to save memory

Dataset description

Let us have a first look at the dataset df.

dim(df) # checking the dimensions of the dataset
[1] 289271     13
head(df) # Viewing the first few observations
ticker date price market_cap price_to_book debt_to_equity profitability volatility revenue ghg_s1 ghg_s2 ghg_s3 return
AAON US Equity 1995-12-31 0.5048 35.1440 2.5948 85.9073 0.8628 71.728 14.720 NA NA NA -0.0980883
AAON US Equity 1996-01-31 0.4719 32.8520 2.4256 85.9073 3.0722 63.087 67.346 NA NA NA -0.0651743
AAON US Equity 1996-02-29 0.5048 35.1440 2.5948 85.9073 3.0722 97.639 67.346 NA NA NA 0.0697182
AAON US Equity 1996-03-31 0.4170 29.0367 2.0805 65.1878 3.1180 100.450 13.438 NA NA NA -0.1739303
AAON US Equity 1996-04-30 0.3841 26.7444 1.9162 65.1878 3.1180 76.133 13.438 NA NA NA -0.0788969
AAON US Equity 1996-05-31 0.3951 27.5445 1.9710 65.1878 3.1180 88.304 13.438 NA NA NA 0.0286384
tail(df) # Viewing the last few observations 
ticker date price market_cap price_to_book debt_to_equity profitability volatility revenue ghg_s1 ghg_s2 ghg_s3 return
ZIXI US Equity 2022-07-29 8.485 481.8671 14.1059 146.8623 -3.7101 4.118 64.85 NA NA NA 0
ZIXI US Equity 2022-08-31 8.485 481.8671 14.1059 146.8623 -3.7101 4.118 64.85 NA NA NA 0
ZIXI US Equity 2022-09-30 8.485 481.8671 14.1059 NA NA 4.118 NA NA NA NA 0
ZIXI US Equity 2022-10-31 8.485 481.8671 14.1059 NA NA 4.118 NA NA NA NA 0
ZIXI US Equity 2022-11-30 8.485 481.8671 14.1059 NA NA 4.118 NA NA NA NA 0
ZIXI US Equity 2022-12-30 8.485 481.8671 14.1059 NA NA 4.118 NA NA NA NA 0
sapply(df, class) # Checking the dtypes of each column
        ticker           date          price     market_cap  price_to_book 
   "character"         "Date"      "numeric"      "numeric"      "numeric" 
debt_to_equity  profitability     volatility        revenue         ghg_s1 
     "numeric"      "numeric"      "numeric"      "numeric"      "numeric" 
        ghg_s2         ghg_s3         return 
     "numeric"      "numeric"      "numeric" 

Variables definition

  • ticker (char): Unique identifier for each company listed on the stock exchange, represented by a series of letters.

  • date (date): Date of recorded value.

  • price (numeric, dbl): Stock price on the given date.

  • market_cap (numeric, dbl): Total market value of a company’s outstanding shares of stock.

  • price_to_book (numeric, dbl): Ratio compares a company’s market value to its book value, providing insights into how the market values the company versus its actual net asset value.

  • debt_to_equity (numeric, dbl): Measures a company’s financial leverage by comparing its total liabilities to shareholders’ equity.

  • profitability (numeric, dbl): Refers to various metrics that measure a company’s ability to generate earnings relative to its revenue, assets, equity, or other financial metrics.

  • volatility (numeric, dbl): Stock’s price volatility, which is a measure of the dispersion of returns for a given stock.

  • revenue (numeric, dbl): Total income generated by the company from its business activities before any expenses are deducted.

  • ghg_s1, ghg_s2, ghg_s3 (numeric, dbl): These columns refer to greenhouse gas emissions, categorized by scope 1, scope 2, and scope 3 emissions. Scope 1 covers direct emissions from owned or controlled sources, scope 2 covers indirect emissions from the generation of purchased electricity, and scope 3 includes all other indirect emissions that occur in a company’s value chain.

  • return (numeric, dbl): Stock’s return over a specific period. This is the value that we need to predict.

# Calculating number of unique companies/tickers

print(paste("Number of unique companies:", length(unique(df$ticker))))
[1] "Number of unique companies: 885"
# Calculating duration

min_date <- min(df$date)
max_date <- max(df$date)
duration_months <- round(as.numeric(max_date - min_date) / 12)
print(paste("Total duration from", min_date, "to", max_date, "that is approximately", duration_months, "months"))
[1] "Total duration from 1995-12-29 to 2023-03-31 that is approximately 830 months"
# Checking if time series is complete

df$year_month <- format(df$date, "%Y-%m")

min_date <- min(df$date)
max_date <- max(df$date)

full_sequence <- seq(from = as.Date(format(min_date, "%Y-%m-01")), 
                     to = as.Date(format(max_date, "%Y-%m-01")), 
                     by = "month")

full_year_month <- format(full_sequence, "%Y-%m")

missing_months <- setdiff(full_year_month, df$year_month)
# Used chatgpt for this code

length(missing_months)
[1] 0

The time series is complete.

The dataset contains monthly observations of financial data for 885 US firms (tickers) from December, 1995 to March, 2023 with data for 27 years i.e, ~830 months with ~289k rows and 13 columns.

Checking is any ticker is not available in the last two years of data as that will be useful while splitting.

x <- max_date - lubridate::years(2)

df |>
  group_by(ticker) |>
  summarise(last_entry = max(date, na.rm = TRUE)) |>
  filter(last_entry <= x) |>
  pull(ticker)  
[1] "CR US Equity"
# Removing all rows with "CR US Equity" ticker

df <- df |>
  filter(ticker != "CR US Equity")

Structure and statistics of the dataset

Structure

str(df[,1:13])
tibble [289,244 × 13] (S3: tbl_df/tbl/data.frame)
 $ ticker        : chr [1:289244] "AAON US Equity" "AAON US Equity" "AAON US Equity" "AAON US Equity" ...
 $ date          : Date[1:289244], format: "1995-12-31" "1996-01-31" ...
 $ price         : num [1:289244] 0.505 0.472 0.505 0.417 0.384 ...
 $ market_cap    : num [1:289244] 35.1 32.9 35.1 29 26.7 ...
 $ price_to_book : num [1:289244] 2.59 2.43 2.59 2.08 1.92 ...
 $ debt_to_equity: num [1:289244] 85.9 85.9 85.9 65.2 65.2 ...
 $ profitability : num [1:289244] 0.863 3.072 3.072 3.118 3.118 ...
 $ volatility    : num [1:289244] 71.7 63.1 97.6 100.5 76.1 ...
 $ revenue       : num [1:289244] 14.7 67.3 67.3 13.4 13.4 ...
 $ ghg_s1        : num [1:289244] NA NA NA NA NA NA NA NA NA NA ...
 $ ghg_s2        : num [1:289244] NA NA NA NA NA NA NA NA NA NA ...
 $ ghg_s3        : num [1:289244] NA NA NA NA NA NA NA NA NA NA ...
 $ return        : num [1:289244] -0.0981 -0.0652 0.0697 -0.1739 -0.0789 ...

We only have numerical data and one date format in this dataset. Also, since the forecasting will be company dependent, ticker is also important.

Statistics

statistics <- summary(df)
statistics
    ticker               date                price           market_cap       
 Length:289244      Min.   :1995-12-29   Min.   :   0.00   Min.   :      0.0  
 Class :character   1st Qu.:2002-10-31   1st Qu.:  12.37   1st Qu.:    427.9  
 Mode  :character   Median :2009-08-31   Median :  24.60   Median :   1783.9  
                    Mean   :2009-08-16   Mean   :  47.53   Mean   :  14358.9  
                    3rd Qu.:2016-06-30   3rd Qu.:  45.85   3rd Qu.:   8226.0  
                    Max.   :2023-03-31   Max.   :5908.87   Max.   :2913283.9  
                                                           NA's   :3973       
 price_to_book      debt_to_equity      profitability        volatility     
 Min.   :0.00e+00   Min.   :      0.0   Min.   :-3300700   Min.   :   0.00  
 1st Qu.:1.32e+00   1st Qu.:     18.6   1st Qu.:       3   1st Qu.:  21.70  
 Median :2.04e+00   Median :     52.9   Median :       7   Median :  30.71  
 Mean   :6.92e+00   Mean   :    203.0   Mean   :    -238   Mean   :  39.13  
 3rd Qu.:3.42e+00   3rd Qu.:    107.6   3rd Qu.:      15   3rd Qu.:  45.38  
 Max.   :1.24e+05   Max.   :1188136.9   Max.   :  370300   Max.   :5137.32  
 NA's   :20126      NA's   :10061       NA's   :6075       NA's   :160      
    revenue             ghg_s1              ghg_s2             ghg_s3         
 Min.   :  -27966   Min.   :     0.00   Min.   :    0.00   Min.   :      0.0  
 1st Qu.:     100   1st Qu.:    31.29   1st Qu.:   66.94   1st Qu.:     54.9  
 Median :     444   Median :   212.20   Median :  279.61   Median :    549.5  
 Mean   :   19534   Mean   :  5209.68   Mean   : 1138.86   Mean   :  30353.2  
 3rd Qu.:    2089   3rd Qu.:  2140.41   3rd Qu.: 1000.00   3rd Qu.:   9130.0  
 Max.   :31379507   Max.   :145500.00   Max.   :29000.00   Max.   :1169970.0  
 NA's   :5078       NA's   :251669      NA's   :253527     NA's   :264840     
     return           year_month       
 Min.   :-0.999900   Length:289244     
 1st Qu.:-0.046082   Class :character  
 Median : 0.007174   Mode  :character  
 Mean   : 0.011709                     
 3rd Qu.: 0.061675                     
 Max.   : 4.140351                     
                                       

Points to note:

  1. There are outliers present in some of the variables as max is considerably higher than the upper quartile values.
  2. There are missing values in most of the columns with high number in all ghg columns.
  3. Negative values in profitability and revenue need to be investigated as they could be anomalies.
  4. Normalization might be required for the columns due to the high range of values.
cat("Percentage of negative revenue values:", (sum(df$revenue < 0, na.rm = TRUE) / nrow(df)) * 100, "%\n")
Percentage of negative revenue values: 0.04529048 %
cat("Number of tickers with negative revenue:", length(unique(df[df$revenue < 0, ]$ticker)), "\n")
Number of tickers with negative revenue: 22 
negative_revenue <- aggregate(revenue ~ ticker, data = df, function(x) mean(x < 0) * 100)

# Checking the percentage of values that are negative for each ticker
negative_revenue[negative_revenue$revenue > 0, ]
ticker revenue
22 AIG US Equity 0.3048780
96 BERK US Equity 0.4184100
104 BK US Equity 0.9146341
107 BLX US Equity 0.9146341
116 BPT US Equity 1.2195122
165 CINF US Equity 0.9146341
178 CMS US Equity 0.3048780
205 CSWC US Equity 14.8606811
220 CVU US Equity 0.9146341
300 FMCC US Equity 1.2195122
302 FNMA US Equity 0.3048780
383 IEP US Equity 0.9146341
464 MBI US Equity 8.8414634
504 MSB US Equity 0.9174312
566 OFG US Equity 0.3048780
643 RCL US Equity 0.9146341
644 RDN US Equity 2.4390244
648 RES US Equity 0.3048780
725 STRS US Equity 0.3048780
865 WTM US Equity 2.1341463
876 Y US Equity 0.9146341
aue_df <- df[df$ticker == "AIG US Equity", ]
ggplot(aue_df, aes(x = date, y = revenue)) +
  geom_line(size = 0.2, color = "blue") +
  labs(title = "Revenue Over Time for AIG US Equity",
       x = "Date",
       y = "Revenue") +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

aue_df$date[aue_df$revenue < 0]
[1] "2008-12-31"

Apparantely, there was a scandal at this time when the revenue went so low for AIG US Equity. Checking the same for some other company that has about 14% negative values for revenue.

cue_df <- df[df$ticker == "CSWC US Equity", ]
ggplot(cue_df, aes(x = date, y = revenue)) +
  geom_line(size = 0.2, color = "blue") +
  labs(title = "Revenue Over Time for CSWC US Equity",
       x = "Date",
       y = "Revenue") +
  theme_minimal()
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_line()`).

cue_df$date[cue_df$revenue < 0]
 [1] NA           NA           NA           NA           NA          
 [6] "1998-09-30" "1998-10-30" "1998-11-30" "1999-03-31" "1999-04-30"
[11] "1999-05-31" "1999-09-30" "1999-10-29" "1999-11-30" "2000-03-31"
[16] "2000-04-28" "2000-05-31" "2000-09-30" "2000-10-31" "2000-11-30"
[21] "2000-12-31" "2001-01-31" "2001-02-28" "2001-04-30" "2001-05-31"
[26] "2001-09-30" "2001-10-31" "2001-11-30" "2002-06-30" "2002-07-31"
[31] "2002-08-30" "2002-09-30" "2002-10-31" "2002-11-29" "2003-04-30"
[36] "2003-05-30" "2004-06-30" "2004-07-30" "2004-08-31" "2004-09-30"
[41] "2004-10-29" "2004-11-30" "2006-06-30" "2006-07-31" "2006-08-31"
[46] "2007-09-30" "2007-10-31" "2007-11-30" "2007-12-31" "2008-01-31"
[51] "2008-02-29" "2008-04-30" "2008-05-30"

This is spread across time.

For the companies where the negative values are less that 2%, I will replace it with 0 and for the rest two companies “MBI US Equity” and “CSWC US Equity”, I am going to drop them from the dataset.

# Dropping two companies
df <- df |> 
  filter(ticker != "MBI US Equity" & ticker != "CSWC US Equity")
df <- df |>
  mutate(revenue = ifelse(revenue < 0, 0, revenue))

Missing values and duplicates

Missing values

# Calculating the null value percentage in each row
round(colSums(is.na(df)) / nrow(df) * 100, 2)
        ticker           date          price     market_cap  price_to_book 
          0.00           0.00           0.00           1.37           6.97 
debt_to_equity  profitability     volatility        revenue         ghg_s1 
          3.48           2.10           0.06           1.76          86.98 
        ghg_s2         ghg_s3         return     year_month 
         87.62          91.54           0.00           0.00 
# Calculating null values in ghg columns post 2011

subset_df <- df |> 
  subset(date > as.Date("2011-12-31"))

round(colSums(is.na(subset_df)) / nrow(subset_df) * 100, 2)
        ticker           date          price     market_cap  price_to_book 
          0.00           0.00           0.00           0.59           7.22 
debt_to_equity  profitability     volatility        revenue         ghg_s1 
          3.80           1.87           0.06           1.47          73.27 
        ghg_s2         ghg_s3         return     year_month 
         74.66          81.93           0.00           0.00 

There are a few null values present in market_cap, price_to_book, debt_to_equity, profitability, volatility, revenue columns, however all ghg columns have more than 80% null values. Even though the ghg started in 2011, we still have 70-80% null values for each of the ghg columns.

# Used gpt for this

# Step 1 & 2: Check for non-missing values and summarize
ticker_summary <- subset_df %>% #subset_df is df after 2011
  group_by(ticker) %>%
  summarise(
    has_ghg_s1 = any(!is.na(ghg_s1)),
    has_ghg_s2 = any(!is.na(ghg_s2)),
    has_ghg_s3 = any(!is.na(ghg_s3))
  )

# Step 3: Calculate the percentage of tickers with non-missing values
percentage_ghg_s1 = sum(ticker_summary$has_ghg_s1) / nrow(ticker_summary) * 100
percentage_ghg_s2 = sum(ticker_summary$has_ghg_s2) / nrow(ticker_summary) * 100
percentage_ghg_s3 = sum(ticker_summary$has_ghg_s3) / nrow(ticker_summary) * 100

# Print the results
print(paste("Percentage of tickers with at least one non-missing GHG S1 value:", percentage_ghg_s1))
[1] "Percentage of tickers with at least one non-missing GHG S1 value: 54.3083900226757"
print(paste("Percentage of tickers with at least one non-missing GHG S2 value:", percentage_ghg_s2))
[1] "Percentage of tickers with at least one non-missing GHG S2 value: 53.4013605442177"
print(paste("Percentage of tickers with at least one non-missing GHG S3 value:", percentage_ghg_s3))
[1] "Percentage of tickers with at least one non-missing GHG S3 value: 39.9092970521542"

I tried to check whether at least one GHG S1 value is present for each company, so I could impute values for that company with the same value, even though this isn’t the right approach as green house gas values change every year and are dependent on various factors for each scope. This still wouldn’t work out as for about 40% of the companies still don’t have any ghg value for each scope.

Unable to find a solution for ghg_s1, s2, s3 so I have decided to drop these columns for the purpose of this project.

# dropping ghg columns

df_cleaner <- drop_ghg_columns(df)
head(df_cleaner)
ticker date price market_cap price_to_book debt_to_equity profitability volatility revenue return
AAON US Equity 1995-12-31 0.5048 35.1440 2.5948 85.9073 0.8628 71.728 14.720 -0.0980883
AAON US Equity 1996-01-31 0.4719 32.8520 2.4256 85.9073 3.0722 63.087 67.346 -0.0651743
AAON US Equity 1996-02-29 0.5048 35.1440 2.5948 85.9073 3.0722 97.639 67.346 0.0697182
AAON US Equity 1996-03-31 0.4170 29.0367 2.0805 65.1878 3.1180 100.450 13.438 -0.1739303
AAON US Equity 1996-04-30 0.3841 26.7444 1.9162 65.1878 3.1180 76.133 13.438 -0.0788969
AAON US Equity 1996-05-31 0.3951 27.5445 1.9710 65.1878 3.1180 88.304 13.438 0.0286384
# Checking if null values are specific to a few tickers

tickers_always_null <- df_cleaner |>
  group_by(ticker) |>
  summarise(
    all_na_market_cap = all(is.na(market_cap)),
    all_na_price_to_book = all(is.na(price_to_book)),
    all_na_debt_to_equity = all(is.na(debt_to_equity)),
    all_na_profitability = all(is.na(profitability)),
    all_na_volatility = all(is.na(volatility)),
    all_na_revenue = all(is.na(revenue))
  ) |>
  filter(
    all_na_market_cap | 
    all_na_price_to_book | 
    all_na_debt_to_equity | 
    all_na_profitability | 
    all_na_volatility | 
    all_na_revenue
  ) |>
  pull(ticker)
# tickers_always_null now contains tickers with always null values in specified columns
length(tickers_always_null)
[1] 40

There are 40 tickers that have null values for one of the columns always. So for those tickers, there is no way to impute the null values for the purpose of this project. I could impute those with 0, but for this project I would leave it as is as imputing it with any value would add to bias.

tickers_always_null
 [1] "ADX US Equity"  "AEM US Equity"  "ASA US Equity"  "AU US Equity"  
 [5] "AZN US Equity"  "BBVA US Equity" "BCE US Equity"  "BCS US Equity" 
 [9] "BHP US Equity"  "BP US Equity"   "CCU US Equity"  "CET US Equity" 
[13] "ENB US Equity"  "GAM US Equity"  "GFI US Equity"  "GILT US Equity"
[17] "GSK US Equity"  "HFC US Equity"  "HMC US Equity"  "IMO US Equity" 
[21] "KOF US Equity"  "NVO US Equity"  "PEO US Equity"  "PHG US Equity" 
[25] "PHI US Equity"  "RIO US Equity"  "SAN US Equity"  "SQM US Equity" 
[29] "SSL US Equity"  "TEF US Equity"  "TEVA US Equity" "TGA US Equity" 
[33] "TGB US Equity"  "TM US Equity"   "TRP US Equity"  "TRXC US Equity"
[37] "UL US Equity"   "WBK US Equity"  "WPP US Equity"  "YPF US Equity" 

Dropping the data for 40 tickers which is about 4.5% of the tickers.

df_cleanest <- df_cleaner |>
  filter(!(ticker %in% tickers_always_null))

round(colSums(is.na(df_cleanest)) / nrow(df_cleanest) * 100, 4)
        ticker           date          price     market_cap  price_to_book 
        0.0000         0.0000         0.0000         0.3517         2.7152 
debt_to_equity  profitability     volatility        revenue         return 
        2.8824         1.4113         0.0508         1.0650         0.0000 

There are still some missing values present in market_cap, price_to_book, debt_to_equity, profitability, volatility and revenue columns.

For these values, i will impute them with either the next or previous value for that ticker for that column.

df_cleanmax <- df_cleanest |>
  group_by(ticker) |>
  fill(everything(), .direction = "updown")

round(colSums(is.na(df_cleanmax)) / nrow(df_cleanmax) * 100, 4)
        ticker           date          price     market_cap  price_to_book 
             0              0              0              0              0 
debt_to_equity  profitability     volatility        revenue         return 
             0              0              0              0              0 

There are no missing values in our df_cleanmax dataset.

Duplicates

Checking for duplicates.

duplicates <- df_cleanmax[duplicated(df_cleanmax), ]
length(duplicates)
[1] 10

There are 10 duplicated rows in the dataset.

df_cleanmaxpro <- distinct(df_cleanmax)

Exploratory data analysis

stocks <- df_cleanmaxpro

Univariate analysis

stocks |>
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) + 
  geom_density(aes(y = ..count..), color = "#404080", size = 1) +  
  labs(title = "Histogram of Price",
       x = "Price") +
  theme_light()

stocks |>
  ggplot(aes(x = price)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +   
  labs(title = "Boxplot of Price",
       x = "Price") +
  theme_light() + 
  theme(axis.text.x = element_blank())

# Function to detect outliers 
detect_outliers <- function(data, variable, threshold = 1.5) {
  q1 <- quantile(data[[variable]], 0.25)
  q3 <- quantile(data[[variable]], 0.75)
  
  iqr <- q3 - q1
  
  lower_bound <- q1 - threshold * iqr
  upper_bound <- q3 + threshold * iqr

  outliers <- data[data[[variable]] < lower_bound | data[[variable]] > upper_bound, "ticker"]
  
  return(outliers)
}

# Detect outliers for the price
outliers <- detect_outliers(stocks, "price")

table(outliers)
ticker
 AAPL US Equity  ABMD US Equity   ABT US Equity  ADBE US Equity   ADI US Equity 
             33              84              32              82              52 
  ADM US Equity   ADP US Equity  ADSK US Equity   AEP US Equity   AFG US Equity 
              1              76              71               6              57 
 AGCO US Equity   AIG US Equity   AIN US Equity   AIT US Equity   AJG US Equity 
             27             153               4              17              36 
  ALG US Equity   ALK US Equity   ALL US Equity   ALX US Equity  AMAT US Equity 
             55               1              47             235              22 
 AMGN US Equity  AMSC US Equity  AMWD US Equity    AN US Equity  ANAT US Equity 
            121             145              13              23             167 
  AON US Equity   APA US Equity   APD US Equity  ARCB US Equity   ARW US Equity 
             89              35             116               2              27 
 ASGN US Equity   ASH US Equity   ATO US Equity   ATR US Equity  ATRI US Equity 
             13              14              41              55             183 
 ATVI US Equity   AVY US Equity   AWR US Equity   AXP US Equity   AXR US Equity 
              1              64               2              57               2 
  AZO US Equity    BA US Equity  BANF US Equity  BBSI US Equity   BBY US Equity 
            202             123               3               2              19 
   BC US Equity  BCPC US Equity   BDX US Equity    BH US Equity  BIIB US Equity 
              5              47             116             315             139 
  BIO US Equity   BMI US Equity  BOKF US Equity  BPOP US Equity   BPT US Equity 
            150              17              16             130              27 
 BXMT US Equity     C US Equity  CACC US Equity  CACI US Equity   CAR US Equity 
            141             152             126              83              19 
 CASY US Equity   CAT US Equity    CB US Equity   CBB US Equity  CBRL US Equity 
             91              93             112              29             110 
  CCF US Equity   CCK US Equity   CDE US Equity  CDNS US Equity   CDR US Equity 
             48              16              26              33              58 
  CFR US Equity  CHCO US Equity   CHD US Equity  CHDN US Equity   CHE US Equity 
             43               3               5              50             105 
   CI US Equity  CINF US Equity  CLDX US Equity   CLF US Equity   CLH US Equity 
            102              33             202               5              16 
  CLX US Equity   CMA US Equity   CMI US Equity   CMO US Equity CNBKA US Equity 
            102               2             136               7              21 
 CNMD US Equity   CNR US Equity  COKE US Equity   COO US Equity   COP US Equity 
             26              86              99             123              11 
 COST US Equity   CPF US Equity   CPK US Equity   CPT US Equity   CRK US Equity 
            128             161              30              42              95 
 CRMT US Equity  CRUS US Equity  CRVL US Equity   CSL US Equity  CTAS US Equity 
             23               2              28              86              82 
  CUZ US Equity   CVM US Equity   CVS US Equity   CVX US Equity    CW US Equity 
              2             171              19             124              66 
   DD US Equity   DDS US Equity    DE US Equity   DHI US Equity   DHR US Equity 
              8              41              78               5              63 
  DIN US Equity  DIOD US Equity   DIS US Equity  DORM US Equity   DOV US Equity 
              9               2              88              13              42 
  DTE US Equity   DUK US Equity   DVN US Equity    DX US Equity    DY US Equity 
             48              20               6              34              10 
   EA US Equity   ECL US Equity    ED US Equity   EFX US Equity   EGP US Equity 
             61             115               4              93              53 
  EMR US Equity   EOG US Equity   ESE US Equity   ETN US Equity   ETR US Equity 
              3              42               6              32              60 
 EXPD US Equity  EXPO US Equity   FBP US Equity  FCFS US Equity FCNCA US Equity 
             24              14             124               5             267 
  FDX US Equity  FICO US Equity  FISV US Equity   FMC US Equity  FOSL US Equity 
            148              86              40              34              30 
  FRT US Equity    GD US Equity    GE US Equity  GILD US Equity   GLW US Equity 
            119             111             234              16               2 
  GPC US Equity   GWW US Equity   HAE US Equity   HAS US Equity    HD US Equity 
             51             160              25              23             102 
  HEI US Equity  HELE US Equity   HES US Equity   HON US Equity   HOV US Equity 
             45              65              23              93             158 
   HP US Equity   HRC US Equity   HSY US Equity   HUM US Equity  IBCP US Equity 
              8              42              89             113              95 
  IBM US Equity  ICUI US Equity   IDA US Equity  IDXX US Equity   IEP US Equity 
            213              91              42              80              23 
  IEX US Equity   IFF US Equity  IMCI US Equity IMKTA US Equity  INTU US Equity 
             72             104              76               1              92 
 IPAR US Equity  ITIC US Equity   ITT US Equity   ITW US Equity  JACK US Equity 
              5              81               2              87              21 
 JBHT US Equity  JBSS US Equity  JJSF US Equity  JKHY US Equity   JNJ US Equity 
             64               2             102              71             107 
 JOUT US Equity   JPM US Equity   KAI US Equity   KEX US Equity  KLAC US Equity 
             13              61              48              11              67 
  KMB US Equity   KSU US Equity   KWR US Equity  LANC US Equity   LEE US Equity 
            115              92              80              92             148 
  LEN US Equity  LFUS US Equity   LLY US Equity   LNN US Equity   LOW US Equity 
             11              95              59              39              52 
 LRCX US Equity  LSTR US Equity   MAN US Equity  MATX US Equity   MCD US Equity 
             77              64              23               3             108 
  MDT US Equity   MGA US Equity  MGPI US Equity  MGRC US Equity   MHK US Equity 
             35               1               9               4             118 
 MIDD US Equity   MKC US Equity   MKL US Equity  MLAB US Equity   MMC US Equity 
             92               5             315              91              45 
  MMM US Equity   MOS US Equity   MRK US Equity    MS US Equity   MSA US Equity 
            124               7               6               6              56 
 MSEX US Equity  MSFT US Equity   MSI US Equity   MTB US Equity   MTH US Equity 
              9              59              84             163              14 
 MTRN US Equity   MTZ US Equity  MXIM US Equity  NATH US Equity  NAVB US Equity 
              2               6              20               1              28 
  NBR US Equity  NDSN US Equity   NEU US Equity   NKE US Equity   NOC US Equity 
            305              80             158              34             114 
  NPK US Equity   NSC US Equity  NTRS US Equity   NTZ US Equity   NUE US Equity 
             52              90              35              39              22 
  NVR US Equity  NWLI US Equity  ODFL US Equity   ODP US Equity  ORLY US Equity 
            269             281              55             137             123 
  OSK US Equity   OXM US Equity   OXY US Equity  PARD US Equity  PAYX US Equity 
             15               8               8             162              25 
  PEP US Equity    PG US Equity   PGR US Equity    PH US Equity   PII US Equity 
             92              50              20             117              75 
  PKI US Equity  PLXS US Equity   PNC US Equity   PNW US Equity   PPG US Equity 
             36               5              77               1              99 
 PRGO US Equity   PRK US Equity   PSA US Equity   PSB US Equity   PTC US Equity 
             52             103             155              84              31 
  PVH US Equity  PZZA US Equity  QCOM US Equity  QDEL US Equity     R US Equity 
             80              12              33              28               1 
  RAD US Equity   RBC US Equity   RCL US Equity  REGN US Equity  RELV US Equity 
            126              72              37             134               1 
  RGA US Equity  RGEN US Equity  RGLD US Equity   RHI US Equity   RIG US Equity 
             75              37              41              10              33 
  RJF US Equity   RLI US Equity   ROG US Equity   ROP US Equity  ROST US Equity 
             16              28              71             133              33 
  RPM US Equity  SAFM US Equity  SANM US Equity  SBCF US Equity  SBUX US Equity 
              3              74              32              39              19 
  SCL US Equity   SEB US Equity   SFE US Equity   SHW US Equity   SIG US Equity 
             37             328              19              77              28 
 SIGI US Equity   SJM US Equity   SLB US Equity   SMG US Equity   SNA US Equity 
              2             118              14              39             115 
 SNPS US Equity   SNV US Equity   SPB US Equity   STE US Equity  STLY US Equity 
             52               5              21              59             139 
  STT US Equity   SWK US Equity  SWKS US Equity   SXI US Equity   SXT US Equity 
              6              86              50              27               2 
  SYK US Equity   TAP US Equity  TARO US Equity  TECH US Equity   TER US Equity 
             89               9              63              11              24 
  TFX US Equity   TGT US Equity   THC US Equity   THO US Equity  TISI US Equity 
            111              43              28              30             184 
  TMO US Equity   TPL US Equity  TROW US Equity   TRV US Equity   TTC US Equity 
            114             113              62             101              16 
 TTEK US Equity   TXN US Equity   TYL US Equity   UFI US Equity   UHS US Equity 
             30              63             105              13             103 
  UHT US Equity   UIS US Equity  UMBF US Equity   UNF US Equity   UNH US Equity 
              8              79               6             113             101 
  UNP US Equity   USM US Equity  USPH US Equity  UTMD US Equity   VAR US Equity 
             92               4              40              12              67 
  VFC US Equity   VHI US Equity  VICR US Equity   VMC US Equity   VMI US Equity 
              1              43               8              93             144 
 VRTX US Equity   WDC US Equity  WDFC US Equity   WEC US Equity   WHR US Equity 
             89              10              89              10             131 
 WIRE US Equity    WM US Equity   WMT US Equity  WRLD US Equity   WSM US Equity 
             18              49              55              46              29 
  WSO US Equity   WST US Equity   WTM US Equity   WTS US Equity     X US Equity 
            107              62             315              33              17 
 XLNX US Equity   XOM US Equity   XRX US Equity     Y US Equity  ZBRA US Equity 
             43              13              28             316              74 
length(table(outliers))
[1] 360

About half of the tickers have outlier values. Since we have data for different tickers, I will treat these outliers per ticker per variable.

stocks |>
  ggplot(aes(x = market_cap)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) + 
  geom_density(aes(y = ..count..), color = "#404080", size = 1) +  
  labs(title = "Histogram of market_cap",
       x = "market_cap") +
  theme_light()

stocks |>
  ggplot(aes(x = market_cap)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +   
  labs(title = "Boxplot of market_cap",
       x = "market_cap") +
  theme_light() + 
  theme(axis.text.x = element_blank())

stocks |>
  ggplot(aes(x = price_to_book)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) + 
  geom_density(aes(y = ..count..), color = "#404080", size = 1) +  
  labs(title = "Histogram of price_to_book",
       x = "price_to_book") +
  theme_light()

stocks |>
  ggplot(aes(x = price_to_book)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +   
  labs(title = "Boxplot of price_to_book",
       x = "price_to_book") +
  theme_light() + 
  theme(axis.text.x = element_blank())

cat("Company with the max price_to_book:", stocks$ticker[which.max(stocks$price_to_book)], "\n")
Company with the max price_to_book: CVM US Equity 
# Checking the price_to_book box plot for this company
cvm_stocks <- subset(stocks, ticker == "CVM US Equity")

cvm_stocks |>
  ggplot(aes(x = price_to_book)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +   
  labs(title = "Boxplot of price_to_book",
       x = "price_to_book") +
  theme_light() + 
  theme(axis.text.x = element_blank())

I will replace these extreme values with the median value of price_to_book for this company as this seems to be an outlier.

# Calculating the median price_to_book 
median_p <- median(stocks$price_to_book[stocks$ticker == "CVM US Equity"], na.rm = TRUE)

# Calculating the 90th percentile price_to_book 
percentile_9 <- quantile(stocks$price_to_book[stocks$ticker == "CVM US Equity"], probs = 0.9, na.rm = TRUE)

# Finding the indices of price_to_book values above the 90th percentile 
upper_10_idx <- which(stocks$ticker == "CVM US Equity" & stocks$price_to_book > percentile_9)

# Imputing the upper 10% price_to_book values with the median for CVM US Equity
stocks$price_to_book[upper_10_idx] <- median_p
stocks |>
  ggplot(aes(x = debt_to_equity)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) + 
  geom_density(aes(y = ..count..), color = "#404080", size = 1) +  
  labs(title = "Histogram of debt_to_equity",
       x = "debt_to_equity") +
  theme_light()

stocks |>
  ggplot(aes(x = debt_to_equity)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +   
  labs(title = "Boxplot of debt_to_equity",
       x = "debt_to_equity") +
  theme_light() + 
  theme(axis.text.x = element_blank())

cat("Company with the max debt_to_equity:", stocks$ticker[which.max(stocks$debt_to_equity)], "\n")
Company with the max debt_to_equity: FNMA US Equity 
# Checking the debt_to_equity box plot for this company
fue_stocks <- subset(stocks, ticker == "FNMA US Equity")

fue_stocks |>
  ggplot(aes(x = debt_to_equity)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +   
  labs(title = "Boxplot of debt_to_equity",
       x = "debt_to_equity") +
  theme_light() + 
  theme(axis.text.x = element_blank())

I will replace this value with the median value of debt_to_equity for this company as this seems to be an outlier.

# calculating the median debt_to_equity 
median_dte<- median(stocks$debt_to_equity[stocks$ticker == "FNMA US Equity"], na.rm = TRUE)
max_dte_idx <- which(stocks$ticker == "FNMA US Equity" & stocks$debt_to_equity == max(stocks$debt_to_equity[stocks$ticker == "FNMA US Equity"], na.rm = TRUE))

# Imputing the maximum debt_to_equity value with the median 
stocks$debt_to_equity[max_dte_idx] <- median_dte
stocks |>
  ggplot(aes(x = debt_to_equity)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) + 
  geom_density(aes(y = ..count..), color = "#404080", size = 1) +  
  labs(title = "Histogram of debt_to_equity",
       x = "debt_to_equity") +
  theme_light()

stocks |>
  ggplot(aes(x = debt_to_equity)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +   
  labs(title = "Boxplot of debt_to_equity",
       x = "debt_to_equity") +
  theme_light() + 
  theme(axis.text.x = element_blank())

It still has a lot of outliers but better than the previous distribution.

stocks |>
  ggplot(aes(x = profitability)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) + 
  geom_density(aes(y = ..count..), color = "#404080", size = 1) +  
  labs(title = "Histogram of profitability",
       x = "profitability") +
  theme_light()

stocks |>
  ggplot(aes(x = profitability)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +   
  labs(title = "Boxplot of profitability",
       x = "profitability") +
  theme_light() + 
  theme(axis.text.x = element_blank())

cat("Percentage of negative profitability values:", (sum(stocks$profitability < 0) / nrow(stocks)) * 100, "%\n")
Percentage of negative profitability values: 13.72491 %

Due to my limited knowledge of the domain, after doing some research I checked that profitability can indeed be negative for some companies. Checking if it is specific to a company.

cat("Company with the minimum profitability:", stocks$ticker[which.min(stocks$profitability)], "\n")
Company with the minimum profitability: GSS US Equity 
# Checking the profitability box plot for this company
gue_stocks <- subset(stocks, ticker == "GSS US Equity")

gue_stocks |>
  ggplot(aes(x = profitability)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +   
  labs(title = "Boxplot of profitability",
       x = "profitability") +
  theme_light() + 
  theme(axis.text.x = element_blank())

summary(gue_stocks)
    ticker               date                price           market_cap      
 Length:328         Min.   :1995-12-31   Min.   : 0.8309   Min.   :   9.717  
 Class :character   1st Qu.:2002-10-23   1st Qu.: 3.5119   1st Qu.: 123.003  
 Mode  :character   Median :2009-08-15   Median : 5.7125   Median : 324.404  
                    Mean   :2009-08-14   Mean   :12.0790   Mean   : 364.548  
                    3rd Qu.:2016-06-07   3rd Qu.:15.6438   3rd Qu.: 502.740  
                    Max.   :2023-03-31   Max.   :93.1250   Max.   :1347.599  
 price_to_book      debt_to_equity    profitability        volatility     
 Min.   :  0.3059   Min.   :   0.00   Min.   :-3300700   Min.   :  7.727  
 1st Qu.:  1.3339   1st Qu.:  15.56   1st Qu.:     -36   1st Qu.: 49.261  
 Median :  2.2096   Median :  29.25   Median :      -9   Median : 64.143  
 Mean   : 44.5658   Mean   : 215.95   Mean   : -126277   Mean   : 75.921  
 3rd Qu.:  7.5202   3rd Qu.: 239.56   3rd Qu.:       4   3rd Qu.: 94.133  
 Max.   :622.1912   Max.   :3187.23   Max.   :      93   Max.   :323.290  
    revenue           return         
 Min.   :  0.00   Min.   :-0.494444  
 1st Qu.: 12.72   1st Qu.:-0.133405  
 Median : 61.91   Median :-0.004281  
 Mean   : 83.58   Mean   : 0.018887  
 3rd Qu.:103.69   3rd Qu.: 0.117813  
 Max.   :550.54   Max.   : 1.694444  

It’s profitability has some outliers. We will impute them with the median value.

# Calculating the median profitability for GSS US Equity
median_p <- median(stocks$profitability[stocks$ticker == "GSS US Equity"], na.rm = TRUE)

# Calculating the 15th percentile profitability for GSS US Equity
percentile_15 <- quantile(stocks$profitability[stocks$ticker == "GSS US Equity"], probs = 0.15, na.rm = TRUE)

# Finding the indices of profitability values below the 12th percentile
lower_15_idx <- which(stocks$ticker == "GSS US Equity" & stocks$profitability < percentile_15)

# Imputing the lower 15% profitability values with the median for GSS US Equity
stocks$profitability[lower_15_idx] <- median_p
# Checking the profitability box plot for this company
gue_stocks <- subset(stocks, ticker == "GSS US Equity")

gue_stocks |>
  ggplot(aes(x = profitability)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +   
  labs(title = "Boxplot of profitability",
       x = "profitability") +
  theme_light() + 
  theme(axis.text.x = element_blank())

# Performing the same process for "MUX US Equity"

# Calculating the median profitability for MUX US Equity
median_p <- median(stocks$profitability[stocks$ticker == "MUX US Equity"], na.rm = TRUE)

# Calculating the 5th percentile profitability for MUX US Equity
percentile_5 <- quantile(stocks$profitability[stocks$ticker == "MUX US Equity"], probs = 0.1, na.rm = TRUE)

# Finding the indices of profitability values below the 5th percentile
lower_5_idx <- which(stocks$ticker == "MUX US Equity" & stocks$profitability < percentile_5)

# Imputing the lower 5% profitability values with the median for MUX US Equity
stocks$profitability[lower_5_idx] <- median_p
# Checking the profitability box plot for this company
gue_stocks <- subset(stocks, ticker == "MUX US Equity")

gue_stocks |>
  ggplot(aes(x = profitability)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +   
  labs(title = "Boxplot of profitability",
       x = "profitability") +
  theme_light() + 
  theme(axis.text.x = element_blank())

stocks |>
  ggplot(aes(x = volatility)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) + 
  geom_density(aes(y = ..count..), color = "#404080", size = 1) +  
  labs(title = "Histogram of volatility",
       x = "volatility") +
  theme_light()

stocks |>
  ggplot(aes(x = volatility)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +   
  geom_vline(xintercept = quantile(stocks$volatility, probs = c(0.99991)), color = "red", linetype = "dashed") +
  labs(title = "Boxplot of volatility",
       x = "volatility") +
  theme_light() + 
  theme(axis.text.x = element_blank())

cat("Company with the max volatility:", stocks$ticker[which.max(stocks$volatility)], "\n")
Company with the max volatility: IMCI US Equity 
# Checking the debt_to_equity box plot for this company
iue_stocks <- subset(stocks, ticker == "IMCI US Equity")

iue_stocks |>
  ggplot(aes(x = volatility)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.5) +   
  labs(title = "Boxplot of volatility",
       x = "volatility") +
  theme_light() + 
  theme(axis.text.x = element_blank())

# Calculating the median volatility 
median_v <- median(stocks$volatility[stocks$ticker == "IMCI US Equity"], na.rm = TRUE)

# Calculating the 90th percentile volatility 
percentile_9 <- quantile(stocks$volatility[stocks$ticker == "IMCI US Equity"], probs = 0.9, na.rm = TRUE)

# Finding the indices of volatility values above the 90th percentile 
upper_10_idx <- which(stocks$ticker == "IMCI US Equity" & stocks$volatility > percentile_9)

# Imputing the upper 10% volatility values with the median for IMCI US Equity
stocks$volatility[upper_10_idx] <- median_v
stocks |>
  ggplot(aes(x = revenue)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) + 
  geom_density(aes(y = ..count..), color = "#404080", size = 1) +  
  labs(title = "Histogram of revenue",
       x = "revenue") +
  theme_light()

stocks |>
  ggplot(aes(x = revenue)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +  
  labs(title = "Boxplot of revenue",
       x = "revenue") +
  theme_light() + 
  theme(axis.text.x = element_blank())

stocks |>
  ggplot(aes(x = return)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) + 
  geom_density(aes(y = ..count..), color = "#404080", size = 1) +  
  labs(title = "Histogram of return",
       x = "return") +
  theme_light()

stocks |>
  ggplot(aes(x = return)) +
  geom_boxplot(fill = "#69b3a2", color = "#69b3a2", alpha = 0.7) +   
  labs(title = "Boxplot of return",
       x = "return") +
  theme_light() + 
  theme(axis.text.x = element_blank())

Checking if all tickers have the same frequency in the dataset.

# counting how many tickers belong to different frequencies
barplot(table(table(stocks$ticker)), main = "Frequency of Ticker Values", 
        xlab = "Ticker", ylab = "Frequency",
        col = "skyblue", border = "black")

Majority of the tickers have a frequency of 328 whereas 29 tickers do not have the same frequency.

names(table(stocks$ticker)[table(stocks$ticker)==129])
[1] "HR US Equity"
summary(subset(stocks, ticker == "HR US Equity"))
    ticker               date                price         market_cap  
 Length:129         Min.   :2012-07-31   Min.   :18.38   Min.   :1970  
 Class :character   1st Qu.:2015-03-31   1st Qu.:24.08   1st Qu.:3151  
 Mode  :character   Median :2017-11-30   Median :26.96   Median :5381  
                    Mean   :2017-11-29   Mean   :26.62   Mean   :4902  
                    3rd Qu.:2020-07-31   3rd Qu.:29.42   3rd Qu.:6033  
                    Max.   :2023-03-31   Max.   :34.05   Max.   :9994  
 price_to_book    debt_to_equity   profitability       volatility    
 Min.   :0.9687   Min.   : 58.12   Min.   :-25.526   Min.   : 9.225  
 1st Qu.:1.7555   1st Qu.: 82.72   1st Qu.:  4.828   1st Qu.:15.853  
 Median :1.9354   Median : 89.64   Median :  7.954   Median :19.240  
 Mean   :1.9832   Mean   : 91.76   Mean   :  9.225   Mean   :21.609  
 3rd Qu.:2.2347   3rd Qu.:100.53   3rd Qu.: 10.893   3rd Qu.:23.957  
 Max.   :2.8958   Max.   :115.05   Max.   : 98.773   Max.   :92.440  
    revenue           return         
 Min.   : 73.47   Min.   :-0.220295  
 1st Qu.:102.05   1st Qu.:-0.035327  
 Median :172.30   Median : 0.002463  
 Mean   :218.81   Mean   : 0.001582  
 3rd Qu.:191.49   3rd Qu.: 0.045754  
 Max.   :932.64   Max.   : 0.125759  

For example, “HR US Equity” stock’s starting year is 2012. I will not remove this data.

Multivariate analysis

Correlation analysis

# selecting variables
selected_vars <- c("price", "market_cap", "price_to_book", "debt_to_equity", "profitability", "volatility", "revenue", "return")

# Calculating Spearman correlation (over pearson) as the data is not normally distributed and we can't assume linear relationship between variables especially as they are ticker dependent
correlation_matrix <- cor(stocks[, selected_vars], method = "spearman")

# Melting the correlation matrix for visualization
melted_correlation <- melt(correlation_matrix)

# Rounding the correlation coefficients to two decimal places
melted_correlation$value <- round(melted_correlation$value, 2)

# used gpt for the colors :)
my_palette <- c("#f7fbff", "#deebf7", "#c6dbef", "#9ecae1", "#6baed6", "#4292c6", "#2171b5", "#08519c", "#08306b")

ggplot(melted_correlation, aes(Var1, Var2, fill = value, label = value)) +
  geom_tile(color = "white") +
  geom_text(color = "black") +  # Add text labels
  scale_fill_gradientn(colors = my_palette, name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank()) +
  coord_fixed()

The variables do not exhibit a strong correlation with the target variable but with each other.

Numerical vs categorical

Performing kruskal-Wallis test between the categorical variable ticker and the target variable return. Since, the data is not normally distributed we cannot use ANOVA.

kruskal_result <- kruskal.test(return ~ ticker, data = stocks)

# Print the test result
print(kruskal_result)

    Kruskal-Wallis rank sum test

data:  return by ticker
Kruskal-Wallis chi-squared = 988.51, df = 841, p-value = 0.0003113

The chi-squared test statistic is 988.51 with 841 degrees of freedom, and the p-value is less than the significance level of 0.05, we reject the null hypothesis.

This indicates that there is evidence to suggest that the median returns differ across the ticker categories.

Hence, there is a statistically significant association between the “return” and “ticker” variables.

Time series analysis

# Aggregating returns by date
agg_returns <- stocks |>
  group_by(date) |> 
  summarise(total_return = sum(return, na.rm = TRUE))

# Plotting aggregate returns over time
gg <- ggplot(agg_returns, aes(date, total_return)) +
  geom_line(col = "blue") +
  labs(x = "Date", y = "Total Return", title = "Aggregate Returns Over Time")

# Convert ggplot to plotly object
ggplotly(gg, dynamicTicks = TRUE)

The returns experience significant fluctuations over the period observed. There are sharp increases and decreases, indicating a high degree of volatility. There are periods of positive and negatives.

stocks$date <- as.Date(stocks$date)

ts_data <- ts(stocks$return, frequency = 1)

arima_model <- auto.arima(ts_data)

summary(arima_model)
Series: ts_data 
ARIMA(2,0,2) with non-zero mean 

Coefficients:
          ar1      ar2     ma1     ma2    mean
      -0.0252  -0.8403  0.0062  0.8233  0.0119
s.e.   0.0183   0.0118  0.0194  0.0120  0.0002

sigma^2 = 0.01557:  log likelihood = 182601.3
AIC=-365190.6   AICc=-365190.6   BIC=-365127.4

Training set error measures:
                        ME     RMSE        MAE MPE MAPE      MASE         ACF1
Training set -3.237388e-07 0.124796 0.07939448 NaN  Inf 0.6855444 -0.005000474
plot(forecast(arima_model))

The non-zero mean indicates that there’s a long-term average effect that the model accounts for, which is important for return series that often exhibit such characteristics. The low error measures and near-zero first autocorrelation of errors suggest that the model’s predictions should be reliable and that the residuals are mostly random, which is a desirable property.

ggplot(stocks, aes(x = date, y = return, color = ticker)) +
  geom_line() +
  labs(x = "Date", y = "Return", title = "Return Over Date for different Tickers") +
  theme_minimal()

set.seed(42)  
random_tickers <- sample(unique(stocks$ticker), 3)

subset_data <- stocks %>% 
  filter(ticker %in% random_tickers)

# Plot
ggplot(subset_data, aes(x = date, y = return, color = ticker)) +
  geom_line() +
  labs(x = "Date", y = "Return", title = "Return Over Date for different tickers")

All three equities have periods of significant volatility, as evidenced by the sharp peaks and troughs. GSS US Equity, in green, shows particularly high volatility with several spikes surpassing a return of 1.0 and dipping below -0.5.

None of the equities show a clear long-term trend upwards or downwards; they all seem to fluctuate around a return of zero. This could imply that their prices have oscillated around a mean value, or that the returns have been adjusted to remove any trend.

There are several notable outliers, particularly for GSS US Equity, which may represent specific events that caused drastic increases or decreases in return on those dates.

Machine Learning

Label encoding ticker

stocks$ticker <- as.numeric(factor(stocks$ticker))

# Adding month and year variables
stocks$month <- month(stocks$date)
stocks$year <- year(stocks$date)
head(stocks)
ticker date price market_cap price_to_book debt_to_equity profitability volatility revenue return month year
1 1995-12-31 0.5048 35.1440 2.5948 85.9073 0.8628 71.728 14.720 -0.0980883 12 1995
1 1996-01-31 0.4719 32.8520 2.4256 85.9073 3.0722 63.087 67.346 -0.0651743 1 1996
1 1996-02-29 0.5048 35.1440 2.5948 85.9073 3.0722 97.639 67.346 0.0697182 2 1996
1 1996-03-31 0.4170 29.0367 2.0805 65.1878 3.1180 100.450 13.438 -0.1739303 3 1996
1 1996-04-30 0.3841 26.7444 1.9162 65.1878 3.1180 76.133 13.438 -0.0788969 4 1996
1 1996-05-31 0.3951 27.5445 1.9710 65.1878 3.1180 88.304 13.438 0.0286384 5 1996

Dataset split into train and test

# Now I will split the data set into train and test

threshold_date <- max(stocks$date) - 365  

df_train <- subset(stocks, date < threshold_date)
df_test <- subset(stocks, date >= threshold_date)
x_train <- df_train[, !names(df_train) %in% c("return", "date")]
y_train <- df_train$return
x_test <- df_test[, !names(df_test) %in% c("return", "date")]
y_test <- df_test$return

Feature engineering

# Apply z-score normalization by ticker for training data
x_train_scaled <- x_train %>%
  group_by(ticker) %>%
  mutate(across(.cols = everything(), .fns = z_score_normalization)) %>%
  ungroup()

# Apply z-score normalization by ticker for testing data
x_test_scaled <- x_test %>%
  group_by(ticker) %>%
  mutate(across(.cols = everything(), .fns = z_score_normalization)) %>%
  ungroup()

Modelling

First model

Setting Up Parameters to train
train_params <- list(objective = "regression", 
                     boosting_type = "gbdt",
                     boost_from_average = "false",
                     learning_rate = 0.005, # Learning rate
                     num_leaves = 192, # Max nb leaves in tree
                     min_data_in_leaf = 100,
                     feature_fraction = 0.3, # % of features
                     bagging_freq = 1,
                     bagging_fraction = 0.7, # % of observations
                     lambda_l1 = 0,
                     lambda_l2 = 0)
Training the model
bst_k <- lightgbm(
  data = as.matrix(x_train_scaled),
  label = y_train, # Target / label
  params = train_params,        # Passing parameter values
  nrounds = 2000 ,     # Number of boosting rounds
  eval_freq = 20,  # Evaluate the model every 20 rounds
  verbose = -1  # Disable verbose output
)
Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
# Calculate predictions
predictions <- predict(bst_k, as.matrix(x_test_scaled))

# Calculate RMSE
rmse <- sqrt(mean((predictions - y_test)^2))

# Print RMSE
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
Root Mean Squared Error (RMSE): 0.1171035 

Let’s use that last model and perform a cross validation

Cross validation

cv_model <- lgb.cv(
  params = train_params,
  data = as.matrix(x_train_scaled),
  label = y_train, # Target / label
  eval_freq = 80,
  nrounds = 3,                  
  nfold = 5
)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002948 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211932, number of used features: 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002987 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211931, number of used features: 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003047 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211931, number of used features: 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002697 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211931, number of used features: 10
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002819 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2133
[LightGBM] [Info] Number of data points in the train set: 211931, number of used features: 10
[1]:  valid's l2:0.0157944+0.000533276 
[3]:  valid's l2:0.0157841+0.000533216 
cv_model$record_evals
$start_iter
[1] 1

$valid
$valid$l2
$valid$l2$eval
$valid$l2$eval[[1]]
[1] 0.01579436

$valid$l2$eval[[2]]
[1] 0.01579027

$valid$l2$eval[[3]]
[1] 0.01578412


$valid$l2$eval_err
$valid$l2$eval_err[[1]]
[1] 0.0005332762

$valid$l2$eval_err[[2]]
[1] 0.0005333652

$valid$l2$eval_err[[3]]
[1] 0.0005332158

Lower values of l2 loss indicate better model performance in terms of minimizing the squared differences between predicted and actual values.A smaller l2 evaluation error indicates less variability in the l2 loss values across different runs.

Model’s performance, as indicated by the l2 loss, is consistent across multiple runs, with relatively low variability in the evaluation error. This consistency is indicative of the stability of the model’s predictive performance.

Hyper parameter tuning

We set up the first combinations

num_leaves <- c(5,30)
learning_rate <- c(0.01, 0.05, 0.2)
pars <- expand.grid(num_leaves, learning_rate)
num_leaves <- pars[,1]
learning_rate <- pars[,2]

Next, the training function. Note that some parameters are flexible, others are fixed.

train_func <- function(num_leaves, learning_rate, x_train_scaled){
  train_params <- list(             # First, the list of params
    num_leaves = num_leaves,        # Max nb leaves in tree
    learning_rate = learning_rate,  # Learning rate
    objective = "regression",           # Loss function
    max_depth = 3,                  # Max depth of trees
    min_data_in_leaf = 50,          # Nb points in leaf
    bagging_fraction = 0.5,         # % of observations
    feature_fraction = 0.7,         # % of features
    nthread = 4,                    # Parallelization
    force_row_wise = T
  )
  
  # Next we train
  bst <- lightgbm(
    data = as.matrix(x_train_scaled),
    label = y_train, # Target / label
    params = train_params,        # Passing parameter values
    eval_freq = 50,
    nrounds = 1000,
    verbose = -1
  )
  # Next, we record the final loss (depends on the model/loss defined above)
  return(loss = bst$record_evals$train$binary_logloss$eval[[10]]) 
}
train_func(10, 0.1, x_train_scaled) # Testing
Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
NULL
train_func <- function(num_leaves, learning_rate, x_train_scaled, y_train) {
  train_params <- list(
    num_leaves = num_leaves,
    learning_rate = learning_rate,
    objective = "regression",
    max_depth = 3,
    min_data_in_leaf = 50,
    bagging_fraction = 0.5,
    feature_fraction = 0.7,
    nthread = 4,
    force_row_wise = TRUE
  )
  
  bst <- lightgbm(
    data = as.matrix(x_train_scaled),
    label = y_train,
    params = train_params,
    eval_freq = 50,
    nrounds = 1000,
    verbose = -1
  )
  
  # Calculate Mean Squared Error (MSE)
  mse_loss <- mean((y_train - predict(bst, newdata = as.matrix(x_train_scaled)))^2)
  
  # Convert MSE to RMSE
  rmse_loss <- sqrt(mse_loss)
  
  # Return the loss value and the trained model
  return(list(loss = rmse_loss, model = bst))
}

# Call the function with appropriate arguments
result <- train_func(10, 0.1, x_train_scaled, y_train)
Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
print(result)
$loss
[1] 0.1151554

$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns

RMSE score is 0.115 so close to our model bst_k.

Finally, we can launch a function that is going to span all free parameters.

grd <- pmap(list(num_leaves = num_leaves, learning_rate = learning_rate),     # Parameters for the grid search
            ~ train_func(..1, ..2, x_train_scaled, y_train) %>% 
                set_names(c("loss", "model")),                          # Function on which to apply the grid search
            x_train_scaled = x_train_scaled,           # Non-changing argument (data is fixed)
            y_train = y_train                    # Non-changing argument (labels are fixed)
)
Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.

Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.

Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.

Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.

Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.

Warning in .get_default_num_threads(): Optional package 'RhpcBLASctl' not
found. Detection of CPU cores might not be accurate.
grd
[[1]]
[[1]]$loss
[1] 0.1214682

[[1]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns


[[2]]
[[2]]$loss
[1] 0.1209835

[[2]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns


[[3]]
[[3]]$loss
[1] 0.1180075

[[3]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns


[[4]]
[[4]]$loss
[1] 0.1170967

[[4]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns


[[5]]
[[5]]$loss
[1] 0.1139586

[[5]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns


[[6]]
[[6]]$loss
[1] 0.1129043

[[6]]$model
LightGBM Model (1000 trees)
Objective: regression
Fitted to dataset with 10 columns
gr <- bind_cols(as.data.frame(pars), as_tibble(map_dbl(grd, "loss")))
gr
Var1 Var2 value
5 0.01 0.1214682
30 0.01 0.1209835
5 0.05 0.1180075
30 0.05 0.1170967
5 0.20 0.1139586
30 0.20 0.1129043

So basically it tells us that the best hyper parameters combinations are 30 Leaves and a lr of 0.20 for a rmse loss score of 0.112

Let’s calculate the predictions again

# Calculate predictions
trained_model <- result$model
predictions <- predict(trained_model, as.matrix(x_test_scaled))

# Calculate RMSE
rmse <- sqrt(mean((predictions - y_test)^2))

# Print RMSE
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
Root Mean Squared Error (RMSE): 0.1159798 

Model interpretability

Feature Importances
importance <- lgb.importance(trained_model)

ggplot(importance, aes(x = Gain, y = reorder(Feature, Gain))) +
  geom_col(fill = "#22AABB", alpha = 0.7) +
  theme_bw() +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_text(size = 6),
        plot.title = element_text(size = 10)) +
  coord_cartesian(clip = "off", ylim = c(0, length(importance$Feature))) +
  labs(title = "Feature Importance") +
  theme(plot.margin = margin(10, 30, 10, 10))

Conclusion

In conclusion, it is evident that both the year and month variables hold paramount significance, given the financial nature of the data and its temporal dependency. Additionally, volatility exhibits a robust association with fluctuations in returns. Moreover, factors such as the stock’s identity, represented by its ticker symbol, and its market capitalization size exert substantial influence on the outcomes.

However, considering the high correlation among certain variables, their individual importance diminishes as much of their information is already captured by other variables. Moving forward, employing advanced models like Neural Prophet could offer promising avenues for further analysis and predictive modeling, potentially yielding deeper insights into the dataset.